Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support iter_epochs for Datasets #19217

Merged
merged 19 commits into from
Oct 12, 2021
Merged

Support iter_epochs for Datasets #19217

merged 19 commits into from
Oct 12, 2021

Conversation

ericl
Copy link
Contributor

@ericl ericl commented Oct 8, 2021

Why are these changes needed?

This PR adds support for splitting a DatasetPipeline into epochs to iterate over, a common ML task. An epoch is represented by a DatasetPipeline limited to data from that epoch only. Epochs are implicitly defined by calls to .repeat(), where an epoch is one repetition of the data, though the mechanism is extensible so we could support customizable epochs in the future.

@ericl ericl changed the title [WIP] Support iter_epochs for Datasets Support iter_epochs for Datasets Oct 8, 2021
@ericl ericl added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Oct 8, 2021
@ericl
Copy link
Contributor Author

ericl commented Oct 8, 2021

ptal

@amogkam
Copy link
Contributor

amogkam commented Oct 8, 2021

@ericl, just to confirm is the behavior the same even if the pipeline structure is changed before/after the repeat call (for example on a rewindow)?

Also, do we want to not even allow iter_epochs to be called if repeat has not been called on the Dataset?

@ericl
Copy link
Contributor Author

ericl commented Oct 8, 2021

@ericl, just to confirm is the behavior the same even if the pipeline structure is changed before/after the repeat call (for example on a rewindow)?

That's right, (re)window does not change epochs (though, it is possible re-windowed windows span epochs--- in that case the max epoch of the window is used as the epoch number).

Also, do we want to not even allow iter_epochs to be called if repeat has not been called on the Dataset?

I tried adding this earlier, but it was a bit tricky. Also, I think it's better if the user can use iter_epochs() safely even if there isn't repeat (e.g., so SGDv2 can always use iter_epochs(), regardless of whether the user repeated the data).

python/ray/data/dataset.py Show resolved Hide resolved
python/ray/data/dataset_pipeline.py Show resolved Hide resolved
python/ray/data/dataset.py Outdated Show resolved Hide resolved
@ericl ericl requested a review from scv119 as a code owner October 11, 2021 22:29
@ericl
Copy link
Contributor Author

ericl commented Oct 11, 2021

@clarkzinzow good catch on the more_itertools dependency, I didn't realize that was not builtin. Removed it with a custom class.

@ericl ericl removed the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Oct 11, 2021
@ericl ericl merged commit 0ab6749 into ray-project:master Oct 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants